Introduction

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve their services so that customers do not renounce their credit cards

Objectives

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Dictionary

  1. CLIENTNUM: Client number. Unique identifier for the customer holding the account
  2. Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  3. Customer_Age: Age in Years
  4. Gender: Gender of the account holder
  5. Dependent_count: Number of dependents
  6. Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  7. Marital_Status: Marital Status of the account holder
  8. Income_Category: Annual Income Category of the account holder
  9. Card_Category: Type of Card
  10. Months_on_book: Period of relationship with the bank
  11. Total_Relationship_Count: Total no. of products held by the customer
  12. Months_Inactive_12_mon: No. of months inactive in the last 12 months
  13. Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  14. Credit_Limit: Credit Limit on the Credit Card
  15. Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  16. Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  17. Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  18. Total_Trans_Ct: Total Transaction Count (Last 12 months)
  19. Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  20. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  21. Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Problem Statement

The problem statement is to find a classification model to identify the profile of Thera Bank customers who are likely to leave their credit card services, so that Thera Bank can identify aspects that they can improve. Customers that churn from the credit card services impact the bank greatly as it causes the bank to suffer losses. Hence, upon model building we will optimize on recall to prevent false negatives, which is costly to the bank.

Data & Library Importing and Checking

There are no missing values or erroneous naming on the data. Let's do an exploratory data analysis to check if there are outliers.

Exploratory Data Analysis

Univariate Analysis

From univariate analysis above, we can see that there are outliers in the data. Some observations include:

  1. Attrition_Flag, the target variable, has much more existing customers than attrited customers.
  2. There are more blue card category.
  3. Age is normally distributed.
  4. Dependent_count seems normally distributed.
  5. Around a quarter of Months_on_book data are around 35-37.
  6. Total_Relationship_Count has a relatively uniform distribution.
  7. Monts_Inactive_12_mon data are mostly around 1-3.
  8. Contacts_Count_12_mon data are normally distributed, most data are around 2-3.
  9. Credit_Limit data are right skewed.
  10. Around a quarter of Total_Revolving_Bal data are of 0 value.
  11. Avg_Open_To_buy data are right skewed.
  12. Total_Amt_Chng_Q4_Q1 data are normally distributed. There are few outliers.
  13. Total_Trans_Amt data have four peaks.
  14. Total_Trans_Ct has two peaks.
  15. Total_Ct_Chng_Q4_Q1 data are normally distributed. There are few outliers.
  16. Around a quarter of Avg_Utilization_Ratio data are of 0 value.

Bivariate Analysis

From the heatmap, we can see that:

  1. Credit_Limit and Avg_Open_To_buy are very strongly correlated. One of them must be removed to prevent multicollinearity in the model.
  2. Total_Trans_Ct and Total_Trans_Amt are strongly correlated.
  3. Months_on_book and Customer_Age are strongly correlated.

From bivariate analysis of categorical data, we can see that:

  1. The percentage of attrited customers among female are higher than male.
  2. There are more doctorates among attrited customers. The percentage of post-graduates that are attrited customers are also quite high.
  3. The percentage of attrited customers among platinum card holders are higher than those of other card categories.
  4. In general, there aren't much difference between the percentage of attrited customers among various categories unless mentioned above.

From bivariate analysis of numerical data, we can see that:

  1. Attrited customers generally have lower Total_Relationship_Count
  2. Attrited customers have slightly higher number of Months_Inactive_12_mon
  3. Attrited customers have lower Total_Revolving_Bal
  4. Attrited customers have lower Total_Trans_Amt
  5. Attrited customers have lower Total_Trans_Ct
  6. Attrited customers have lower Total_Ct_Chng_Q4_Q1
  7. Attrited customers have lower Avg_Utilization_Ratio

These could provide a hint on the factors that impact the target variable greatly, but let's prove if they are correct from the modelling below.

Data Preparation

Model Building

Logistic Regression

As per discussed above, the metric of interest in the problem is recall.

From the tests above, we can see that regularization does not help in model improvement. This is because the coefficients of the model before regularization do not differ that much in the first place.

From the various models tested above, it can be concluded that among logistic regression models, the most reliable one is logistic regression with IMBlearn random oversampling/upsampling without regularization due to it's high recall value. The model's accuracy and F1 score are also the most reliable.

Bagging

As per expected, the performance metrics of the decision tree improved. However, in a way the addition of max_depth classifies as tuning for the decision tree model. Hence, we will include only the basic decision tree model (without max_depth limitation) for comparison with the other untuned bagging & boosing models to check the best three models.

Boosting

Finding Top 3 Best Models for Recall

Pipeline and cross validation are used.

From the comparison of the metrics, the top three most performing models are XGBoost, Gradient Boosting, and AdaBoost. Hence, we will focus the tuning using GridSearchCV and RandomizedSearchCV on the three models to maximize recall values.

Model Tuning - GridSearchCV

Note : Tuning is done by running the code repeatedly with changes in the parameters. The code scripts that appeared are a reflection of the closest parameters to the ideal scenario for the best model, after several tests being run at the Jupyter Notebook cell.

AdaBoost

Gradient Boosting

XGBoost

Due to time constraint, the parameters from optimization with RandomizedSearch are used. The GridSearchCV application to the model is for further tuning of the XGBoost model.

Bagging Classifier (Addition of Fourth Best Model due to XGBoost GridSearchCV Model being an update from RandomizedSearchCV)

Model Tuning - RandomizedSearchCV

AdaBoost

Gradient Boosting

XGBoost

Model Comparison, Best Model Selection and Feature Importances

We can see that XGBoost with RandomizedSearchCV is the best model, followed by XGBoost with GridSearchCV. This is because the models provide the highest values of recall on the test set. As per discussed earlier, recall is the metric of interest as false negatives would be very costly to the bank. Let's see on the feature importances of the best model as per below.

It can be seen that the most important feature is Total_Trans_Ct. The second and third most important features are Total_Ct_Chng_Q4_Q1 and Total_Trans_Amt respectively. Other visible factors that contribute to the model are Total_Revolving_Bal, Total_Relationship_Count, Months_Inactive_12_mon, Total_Amt_Chng_Q4_Q1, and Avg_Utilization_Ratio. As per seen from bivariate analysis, it is proven using the model that the factors contribute greatly to the target variable.

Let's see the comparison of time taken between the models tested using both GridSearchCV and RandomizedSearchCV below:

From the comparison table above, it is clear that GridSearchCV took significantly longer time than RandomizedSearchCV and yet RandomizedSearchCV is able to provide better recall values.

Actionable Insights & Recommendations

From the findings above, we can see on the top three most impact factors to attriting customers :

  1. Total transaction count over the last 12 months of the customers
  2. Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  3. Total transaction amount over the last 12 months of the customers

Additional things to pay attention can also include :

  1. Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  2. Total_Relationship_Count: Total no. of products held by the customer
  3. Months_Inactive_12_mon: No. of months inactive in the last 12 months
  4. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  5. Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Hence, the actionable insights & recommendations can include :

  1. To pay attention to/make a detector on the total transaction count & amount over the last 12 months (can be made to 10-11 months for early detection) to find out on the customers that have a chance of renouncing the credit card. A certain threshold can be set, and if the numbers are below that threshold, the bank can approach the customer, whether to provide special promotions or to market other products.
  2. To pay attention to/make a detector on the ratio of total transaction count & amount in 4th quarter and the total transaction count in 1st quarter to find out on the customers that have a chance of renouncing the credit card. A certain threshold can be set, and if the numbers are below that threshold, the bank can approach the customer, whether to provide special promotions or to market other products (a new year's promotion might help).
  3. As Total_Revolving_Bal is a factor, the bank can introduce a step-up bonus, which will be given to customers when the balance on a month is a certain number above the balance on the month before. This will make the customer more loyal and more willing to put in money in the bank.
  4. The bank can also introduce more products to customers who have a low amount of products held. Successful marketing will make customers more attached to the bank and more willing to use the products & services the bank used (and that includes credit cards).
  5. To pay attention to/make a detector on whether the customer was inactive in the last 12 months (can be made to 10-11 months for early detection) to find out on the customers that have a chance of renouncing the credit card. A certain threshold can be set, and if the numbers are above that threshold, the bank can approach the customer, whether to provide special promotions or to market other products.